optimality operator
Reviews: A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation
Summary: The paper proposes a new algorithm for learning linear preferences, which are objectives derived from a linear weighting of a vector reward function, in multi-objective reinforcement learning (MORL). The proposed algorithm achieves this by performing updates that use the convex envelope of the solution frontier to update the parameters of the action-value function, hence its name: envelop Q-learning. This is done by first defining a multi-objective version of the action-value function along with a pseudo-metric, the supremum over the states, actions, and preferences. Then, a Bellman operator is defined for the multi-objective action-value function along with an optimality filter, which together define a new optimality operator. Using all these definitions, the paper then shows three main theoretical results: 1) the optimality operator has a fixed point that maximizes the amount of reward under any given preference, 2) the optimality operator is a contraction, and 3) for any Q in the pseudo-metric space, iterative applications of the optimality operator will result in an action-value function which distance to the fixed point is equal to zero, i.e. is equivalent to the fixed point.
Accelerating Value Iteration with Anchoring
Value Iteration (VI) is foundational to the theory and practice of modern reinforcement learning, and it is known to converge at a $\mathcal{O}(\gamma^k)$-rate, where $\gamma$ is the discount factor. Surprisingly, however, the optimal rate for the VI setup was not known, and finding a general acceleration mechanism has been an open problem. In this paper, we present the first accelerated VI for both the Bellman consistency and optimality operators. Our method, called Anc-VI, is based on an \emph{anchoring} mechanism (distinct from Nesterov's acceleration), and it reduces the Bellman error faster than standard VI. In particular, Anc-VI exhibits a $\mathcal{O}(1/k)$-rate for $\gamma\approx 1$ or even $\gamma=1$, while standard VI has rate $\mathcal{O}(1)$ for $\gamma\ge 1-1/k$, where $k$ is the iteration count. We also provide a complexity lower bound matching the upper bound up to a constant factor of $4$, thereby establishing optimality of the accelerated rate of Anc-VI. Finally, we show that the anchoring mechanism provides the same benefit in the approximate VI and Gauss--Seidel VI setups as well.
PD-MORL: Preference-Driven Multi-Objective Reinforcement Learning Algorithm
Basaklar, Toygun, Gumussoy, Suat, Ogras, Umit Y.
Multi-objective reinforcement learning (MORL) approaches have emerged to tackle many real-world problems with multiple conflicting objectives by maximizing a joint objective function weighted by a preference vector. These approaches find fixed customized policies corresponding to preference vectors specified during training. However, the design constraints and objectives typically change dynamically in real-life scenarios. Furthermore, storing a policy for each potential preference is not scalable. Hence, obtaining a set of Pareto front solutions for the entire preference space in a given domain with a single training is critical. To this end, we propose a novel MORL algorithm that trains a single universal network to cover the entire preference space scalable to continuous robotic tasks. The proposed approach, Preference-Driven MORL (PD-MORL), utilizes the preferences as guidance to update the network parameters. It also employs a novel parallelization approach to increase sample efficiency. We show that PD-MORL achieves up to 25% larger hypervolume for challenging continuous control tasks and uses an order of magnitude fewer trainable parameters compared to prior approaches. The main objective in a standard RL setting is to obtain a policy that maximizes a single cumulative reward by interacting with the environment. However, many real-world problems involve multiple, possibly conflicting, objectives. For example, robotics tasks should maximize speed while minimizing energy consumption. In contrast to single-objective environments, performance is measured using multiple objectives. Consequently, there are multiple Pareto-optimal solutions as a function of the preference between objectives (Navon et al., 2020). Multi-objective reinforcement learning (MORL) approaches (Hayes et al., 2022) have emerged to tackle these problems by maximizing a vector of rewards depending on the preferences. Existing approaches for multi-objective optimization generally transform the multidimensional objective space into a single dimension by statically assigning weights (preferences) to each objective (Liu et al., 2014).
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > New Jersey > Mercer County > Princeton (0.04)
- Information Technology (0.45)
- Energy (0.34)
Multi-Objective Influence Diagrams with Possibly Optimal Policies
Marinescu, Radu (IBM, Dublin) | Razak, Abdul (University College Cork) | Wilson, Nic (University College Cork)
The formalism of multi-objective influence diagrams has recently been developed for modeling and solving sequential decision problems under uncertainty and multiple objectives. Since utility values representing the decision maker's preferences are only partially ordered (e.g., by the Pareto order) we no longer have a unique maximal value of expected utility, but a set of them. Computing the set of maximal values of expected utility and the corresponding policies can be computationally very challenging. In this paper, we consider alternative notions of optimality, one of the most important one being the notion of possibly optimal, namely optimal in at least one scenario compatible with the inter-objective tradeoffs. We develop a variable elimination algorithm for computing the set of possibly optimal expected utility values, prove formally its correctness, and compare variants of the algorithm experimentally.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Ireland > Munster > County Cork > Cork (0.04)